Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Performance optimization strategy of distributed storage for industrial time series big data based on HBase
Li YANG, Jianting CHEN, Yang XIANG
Journal of Computer Applications    2023, 43 (3): 759-766.   DOI: 10.11772/j.issn.1001-9081.2022020211
Abstract388)   HTML15)    PDF (2121KB)(167)    PDF(mobile) (619KB)(12)    Save

In automated industrial scenarios, the amount of time series log data generated by a large number of industrial devices has exploded, and the demand for access to time series data in business scenarios has further increased. Although HBase, a distributed column family database, can store industrial time series big data, the existing strategies cannot meet the specific access requirements of industrial time series data well because the correlation between data and access behavior characteristics in specific business scenarios is not considered. In view of the above problem, based on the distributed storage system HBase, and using the correlation between data and access behavior characteristics in industrial scenarios, a distributed storage performance optimization strategy for massive industrial time series data was proposed. Aiming at the load tilt problem caused by characteristics of industrial time series data, a load balancing optimization strategy based on hot and cold data partition and access behavior classification was proposed. The data were classified into cold and hot ones by using a Logistic Regression (LR) model, and the hot data were distributed and stored in different nodes. In addition, in order to further reduce the cross-node communication overhead in storage cluster and improve the query efficiency of the high-dimensional index of industrial time series data, a strategy of putting the index and main data into a same Region was proposed. By designing the index RowKey field and splicing rules, the index was stored with its corresponding main data in the same Region. Experimental results on real industrial time series data show that the data load distribution tilt degree is reduced by 28.5% and the query efficiency is improved by 27.7% after introducing the optimization strategy, demonstrating the proposed strategy can mine access patterns for specific time series data effectively, distribute load reasonably, reduce data access overhead, and meet access requirements for specific time series big data.

Table and Figures | Reference | Related Articles | Metrics
Text semantic de-duplication algorithm based on keyword graph representation
Jinyun WANG, Yang XIANG
Journal of Computer Applications    2023, 43 (10): 3070-3076.   DOI: 10.11772/j.issn.1001-9081.2022101495
Abstract194)   HTML17)    PDF (1266KB)(106)       Save

There are a large number of redundant texts with the same or similar semantics in the network. Text de-duplication can solve the problem that redundant texts waste storage space and can reduce unnecessary consumption for information extraction tasks. Traditional text de-duplication algorithms rely on literal overlapping information, and do not make use of the semantic information of texts; at the same time, they cannot capture the interaction information between sentences that are far away from each other in long text, so that the de-duplication effect of these methods is not ideal. Aiming at the problem of text semantic de-duplication, a long text de-duplication algorithm based on keyword graph representation was proposed. Firstly, the text pair was represented as a graph with the keyword phrase as the vertex by extracting the semantic keyword phrase from the text pair. Secondly, the nodes were encoded in various ways, and Graph Attention Network (GAT) was used to learn the relationship between nodes to obtain the vector representation of text to the graph, and judge whether the text pairs were semantically similar. Finally, the de-duplication processing was performed according to the text pair’s semantical similarity. Compared with the traditional methods, this method can use the semantic information of texts effectively, and through the graph structure, the method can connect the distant sentences in the long text by the co-occurrence relationship of keyword phrases to increase the semantic interaction between different sentences. Experimental results show that the proposed algorithm performs better than the traditional algorithms, such as Simhash, BERT (Bidirectional Encoder Representations from Transformers) fine-tuning and Concept Interaction Graph (CIG), on both CNSE (Chinese News Same Event) and CNSS (Chinese News Same Story) datasets. Specifically, the F1 score of the proposed algorithm on CNSE dataset is 84.65%, and that on CNSS dataset reaches 90.76%. The above indicates that the proposed algorithm can improve the effect of text de-duplication tasks effectively.

Table and Figures | Reference | Related Articles | Metrics
Chinese event detection based on data augmentation and weakly supervised adversarial training
Ping LUO, Ling DING, Xue YANG, Yang XIANG
Journal of Computer Applications    2022, 42 (10): 2990-2995.   DOI: 10.11772/j.issn.1001-9081.2021081521
Abstract622)   HTML50)    PDF (720KB)(299)       Save

The existing event detection models rely heavily on human-annotated data, and supervised deep learning models for event detection task often suffer from over-fitting when there is only limited labeled data. Methods of replacing time-consuming human annotation data with auto-labeled data typically rely on sophisticated pre-defined rules. To address these issues, a BERT (Bidirectional Encoder Representations from Transformers) based Mix-text ADversarial training (BMAD) method for Chinese event detection was proposed. In the proposed method, a weakly supervised learning scene was set on the basis of data augmentation and adversarial learning, and a span extraction model was used to solve event detection task. Firstly, to relieve the problem of insufficient data, various data augmentation methods such as back-translation and Mix-Text were applied to augment data and create weakly supervised learning scene for event detection. And then an adversarial training mechanism was applied to learn with noise and improve the robustness of the whole model. Several experiments were conducted on commonly used real-world dataset Automatic Context Extraction (ACE) 2005. The results show that compared with algorithms such as Nugget Proposal Network (NPN), Trigger-aware Lattice Neural Network (TLNN) and Hybrid-Character-Based Neural Network (HCBNN), the proposed method has the F1 score improved by at least 0.84 percentage points.

Table and Figures | Reference | Related Articles | Metrics
Collaborative filtering and recommendation algorithm based on matrix factorization and user nearest neighbor model
YANG Yang XIANG Yang XIONG Lei
Journal of Computer Applications    2012, 32 (02): 395-398.   DOI: 10.3724/SP.J.1087.2012.00395
Abstract1465)      PDF (660KB)(1419)       Save
Concerning the difficulty of data sparsity and new user problems in many collaborative recommendation algorithms, a new collaborative recommendation algorithm based on matrix factorization and user nearest neighbor was proposed. To guarantee the prediction accuracy of the new users, the user nearest neighbor model based on user data and profile information was used. Meanwhile, large data sets and the problem of matrix sparsity would significantly increase the time and space complexity. Therefore, matrix factorization was introduced to alleviate the effect of data problems and improve the prediction accuracy. The experimental results show that the new algorithm can improve the recommendation accuracy effectively, and solve the problems of data sparsity and new user.
Reference | Related Articles | Metrics
Research on topic maps-based ontology information retrieval model
LI QingMao XingJiang Yang Xiang-Bing Zhou
Journal of Computer Applications    2010, 30 (1): 240-242.  
Abstract1733)      PDF (506KB)(906)       Save
Ontology is normative, explicit and reusable when defining the domain concept, so it can be combined with topic maps to organize information resource for semantic navigation. An information retrieval model based on topic maps and ontology was proposed and defined formally. Firstly it specified a domain of tourism document. Secondly it defined the ontology and topic maps of tourism document in order to normalize query that user directly input in natural language, and identified the users real meaning of search. Thus, it can expand user' semantic search. Therefore analyzed the effect of the ontology was analyzed, and a valuable function of semantic navigation and sorting the retrieval result correlated with user's query was shown. Finally,the experimental result shows that the topic mapbased ontology information retrieval model can perform better than the traditional model.
Related Articles | Metrics
Research of concept cluster based ontology mapping
Wen-tao LV Yang XIANG Bo ZHANG
Journal of Computer Applications   
Abstract1615)      PDF (817KB)(1127)       Save
Ontology heterogeneity is a big bottleneck of ontology application, and ontology mapping is the base for integration of heterogeneous ontology. Concept Cluster based Ontology Mapping (CCOM) used information of concepts' structure in ontology mapping, using similarity of concept cluster to replace similarity of concept for mapping rules reasoning. Experimental results show that CCOM is of very good recall and precision.
Related Articles | Metrics
Online tracing Petri dish of large scale worm
Qiang LI Jian KANG Yang XIANG
Journal of Computer Applications   
Abstract1602)            Save
Breaking out of network worms brings a tremendous damage to the Internet. Launch the worm defense and response can improve network anti-strike capability. Tracing worm propagation path after its outbreak can reconstruct not only the earliest infected nodes but also the timing order of victims been infected. For the detection and defense of large scale Internet worm outbreaks, a convenient and safety experimental environment that capable of running real worm become an important work to observe large scale worm infection, intrusion and propagation, it can be a large scale worm testbed for forensic evidence. This paper presents a large-scale worm propagation experiments environment for tracing algorithm, which is an isolation environment that can run related experiments. To conform as much as really to the actual network, the experimental environment use virtual machine technology, simulate a large number of hosts and network equipments attend. According to the actual worm, this environment can trigger large-scale worm outbreaks within the controllable scope of human, observe propagation process of the worm, experiment detection and defense techniques, discover worm propagation characteristic such as scanning method and propagation process, real-time collect network traffic and propagation process, investigate network traffic, launch speculate algorithm for reconstructing out patient zero and propagation path of the worm. Then actual worm propagation process can be captured and compared with the results using tracing algorithm.
Related Articles | Metrics